In this tutorial we are going to show several ways to add fitted line to scatter plot using popular visualization packages in Python. In order to add fitted line it is needed first to approximate the line using some regression methods. User can either compute the line by ones own or utilize built-in tools for this purpose. We are going to investigate both ways of adding the fitted line to scatterplot. With Matplotlib and Bokeh we are going to visualize fitted lines obtained via our custom functions. In the case of Plotly and Seaborn we are going to use built-in approximators. Also we will demonstrate options to add interactivity to plots with Bokeh and Plotly packages. Additionally since we are trying to create similar plots using different tools it will be possible to compare those tools from the point of view of ease of use and flexibility. Shortly, the content is organised in the following way:
As it was mentioned above we are going to use 4 popular visualization tools. Below you can find all necessary functions imported from the packages in order to complete the task successfully. Some of the functions imported here or defined later are helper functions and have been used in order to organize the code and the visalized output in a better way.
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
from plotly.subplots import make_subplots
from bokeh.layouts import gridplot
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.models.tools import CustomJSHover
from bokeh.resources import INLINE
from bokeh.io import output_notebook
import pandas as pd
import numpy as np
# Let`s set a fixed random seed in order to get same results for different executions
np.random.seed(42)
We are going to generate linear, quadratic and plonomial function with order of 5 on a grid of 20 values within the range from -1 to 1.
# Create a sample of 20 values on axis X
X = np.linspace(-1, 1, 20)
# Get an array of error terms within the range [-1,1] of size equals to the numer of X values
delta = np.random.uniform(-1, 1, X.size)
# Calculate y`s of different orders.
y_linear = 0.4*X + 3 + delta
y_quadratic = X**2 + 0.4*X + 3 + delta
y_poly = X**5 + X**2 + 0.4*X + 3 + delta
# Display created Dataset
df = pd.DataFrame({'X':X,'y_linear':y_linear,'y_quadratic':y_quadratic,'y_poly':y_poly})
df
| X | y_linear | y_quadratic | y_poly | |
|---|---|---|---|---|
| 0 | -1.000000 | 2.349080 | 3.349080 | 2.349080 |
| 1 | -0.894737 | 3.543534 | 4.344088 | 3.770663 |
| 2 | -0.789474 | 3.148198 | 3.771467 | 3.464785 |
| 3 | -0.684211 | 2.923633 | 3.391777 | 3.241826 |
| 4 | -0.578947 | 2.080458 | 2.415638 | 2.350596 |
| 5 | -0.473684 | 2.122515 | 2.346892 | 2.323044 |
| 6 | -0.368421 | 1.968799 | 2.104533 | 2.097745 |
| 7 | -0.263158 | 3.627089 | 3.696341 | 3.695079 |
| 8 | -0.157895 | 3.139072 | 3.164003 | 3.163905 |
| 9 | -0.052632 | 3.395093 | 3.397863 | 3.397862 |
| 10 | 0.052632 | 2.062222 | 2.064992 | 2.064992 |
| 11 | 0.157895 | 4.002978 | 4.027908 | 4.028006 |
| 12 | 0.263158 | 3.770148 | 3.839401 | 3.840663 |
| 13 | 0.368421 | 2.572047 | 2.707781 | 2.714568 |
| 14 | 0.473684 | 2.553124 | 2.777500 | 2.801348 |
| 15 | 0.578947 | 2.598388 | 2.933568 | 2.998610 |
| 16 | 0.684211 | 2.882169 | 3.350313 | 3.500264 |
| 17 | 0.789474 | 3.365302 | 3.988571 | 4.295253 |
| 18 | 0.894737 | 3.221785 | 4.022339 | 4.595764 |
| 19 | 1.000000 | 2.982458 | 3.982458 | 4.982458 |
Below we define two functions that approximate the fitting line based on our data. The first function approximates coeffitients of original equations using least squares method which is implemented in code from scratch. The second one utilizes built-in function from numpy package. np.polyfit() also is least squares method and returns a vector of coefficients that minimises the squared error.
# Define a function for approximation of y line using linear algebra methods
def get_fitted_line_from_scratch(X,y,order):
n = order+1
# Define the matrix
M = np.array([np.sum(X**(i+j)) for j in range(n) for i in range(n)]).reshape((n,n))
# Calculate an array of biases
bias = np.array([np.sum(X**(i)*y) for i in range(n)]).reshape((n,1))
# Calculate coeffitients
coefs = np.dot(np.linalg.inv(M),bias)
# Calculate matrix of powers for inputs
X = np.transpose(np.array([np.power(X,i) for i in range(n)]).reshape((n,-1)))
# Calculate fitted line values
fitted_line = X.dot(coefs)
return fitted_line
# Define a function for approximation of y line using numpy built in functions
def get_fitted_line_using_numpy(X,y,order):
n = order+1
# Calculate coeffitients using polyfit function from numpy package
coefs = np.polyfit(X, y, order)
# Flip the order of values
coefs = np.flip(coefs,0)
# Calculate matrix of powers for inputs
X = np.transpose(np.array([np.power(X,i) for i in range(n)]).reshape((n,-1)))
# Calculate fitted line values
fitted_line = X.dot(coefs)
return fitted_line
# Calculte fitted lines
df['y_linear_fitted_scratch'] = get_fitted_line_from_scratch(X, y_linear, order = 1)
df['y_quadratic_fitted_scratch'] = get_fitted_line_from_scratch(X, y_quadratic, order = 2)
df['y_poly_fitted_scratch'] = get_fitted_line_from_scratch(X, y_poly, order = 5)
df['y_linear_fitted_numpy'] = get_fitted_line_using_numpy(X, y_linear, 1)
df['y_quadratic_fitted_numpy'] = get_fitted_line_using_numpy(X, y_quadratic, 2)
df['y_poly_fitted_numpy'] = get_fitted_line_using_numpy(X, y_poly, 5)
As the first visualization tool we are going to use matplotlib. Below we define a grid of subplots consisting of 6 plots visualizing linear, quadratic and polinomial fitted lines computed from scratch and with the help of numpy package. After initializing a figure and axes, and setting spacing values between plots we define __visualize_using_matplotlib__ function where we describe all the neccessary steps. First we should add scatterplot and fitted line using __scatter__ and __plot__ functions from matplotlib. Then it is needed to add some description in order to make plots comprehensible. In this example we define __x__ and __y__ lables and add legend as a description for the method of approximation for fitted line.
# Create six axes and access them through the returned array
fig, axs = plt.subplots(2, 3, figsize=(15,10))
# Set spacing values between plots
plt.subplots_adjust(top = 0.99, bottom=0.01, hspace=0.25, wspace=0.4)
def visualize_using_matplotlib(axis, df, X, y, y_fitted, title, method):
# Add scatterplot to axis
axs[axis].scatter(df[X], df[y])
# Add fitted line to axis
axs[axis].plot(df[X], df[y_fitted],color='red', label=method)
# Set title
axs[axis].set_title(title)
# Set xlabel
axs[axis].set_xlabel('X')
# Set ylabel
axs[axis].set_ylabel('y')
# Add legend
axs[axis].legend()
# Create plots for fitted line created from scratch
fitted_line_method = 'scratch'
visualize_using_matplotlib((0,0), df, 'X', 'y_linear', 'y_linear_fitted_scratch', 'linear', fitted_line_method)
visualize_using_matplotlib((0,1), df, 'X', 'y_quadratic', 'y_quadratic_fitted_scratch', 'quadratic', fitted_line_method)
visualize_using_matplotlib((0,2), df, 'X', 'y_poly', 'y_poly_fitted_scratch', 'poly', fitted_line_method)
# Create plots for fitted lines created using numpy
fitted_line_method = 'numpy'
visualize_using_matplotlib((1,0), df, 'X', 'y_linear', 'y_linear_fitted_numpy', 'linear', fitted_line_method)
visualize_using_matplotlib((1,1), df, 'X', 'y_quadratic', 'y_quadratic_fitted_numpy', 'quadratic', fitted_line_method)
visualize_using_matplotlib((1,2), df, 'X', 'y_poly', 'y_poly_fitted_numpy', 'poly', fitted_line_method)
plt.tight_layout()
__Bokeh__ is a widely used visualiztion tool, which creates interactive plots and allows to export them into __html__ format without loosing the interactivity of the rendered visualizations. In this example as a demonstration of interactivity we add a __HoverTool__. This tool allows us to get some information about plotted data while hovering over the point on the grid set.
As a base of most of the bokeh visualizations we define __ColumnDataSource__ which provides the data to the glyphs of the plots. Then we initialize the figure object where we define size, title and other general parameters of the created plot.
Further we create a scatter plot using a __circle__ object which configure and add __Scatter__ glyphs. We should add reference to our data using source parameter of this object. __x__ and __y__ parameters define centers of the circles on the plot based on the data received from the data source. The __line__ glyph visualize fitted line provided from data source.
Finally we add a __HoverTool__ using add_tools method. The structure of this tool is quite complex. By default, the hover tool displays informational tooltips whenever the cursor is directly over a glyph. The data to show comes from the glyph’s data source, and what to display is configurable with the tooltips property that maps display names to columns in the data source, or to special known variables. Here we also use 'CustomJSHover' which allows us to create custom formatter to apply to a hover tool field. In this custom formatter we retrieve data-space x and y coordinates for the hovering glyph.
def visualize_using_bokeh(df, X, y, y_fitted, title, method):
# Define figure parameters
source = ColumnDataSource(df)
plot = figure(width=400, plot_height=400, x_axis_label='X', y_axis_label='y', title=title)
# Add circles to form scatterplot
plot.circle(x=X, y=y, size=10, color="navy", alpha=0.5, source=source)
# Add regression line
plot.line(x=X, y=y_fitted, line_width=3, source=source, color='red', legend_label = method)
plot.add_tools(HoverTool(show_arrow=False, tooltips=[('X', '$data_x'), ('y', '$data_y')],
formatters=dict(
X=CustomJSHover(code="""return '' + special_vars.data_x"""),
y_linear=CustomJSHover(code="""return '' + special_vars.data_y""")
)
))
return plot
# Create plots for fitted line created from scratch
fitted_line_method = 'scratch'
plot_linear_scratch = visualize_using_bokeh(df, 'X', 'y_linear', 'y_linear_fitted_scratch', 'linear', fitted_line_method)
plot_quadratic_scratch = visualize_using_bokeh(df, 'X', 'y_quadratic', 'y_quadratic_fitted_scratch', 'quadratic', fitted_line_method)
plot_poly_scratch = visualize_using_bokeh(df, 'X', 'y_poly', 'y_poly_fitted_scratch', 'poly', fitted_line_method)
# Create plots for fitted line created using numpy
fitted_line_method = 'numpy'
plot_linear_numpy = visualize_using_bokeh(df, 'X', 'y_linear', 'y_linear_fitted_numpy', 'linear', fitted_line_method)
plot_quadratic_numpy = visualize_using_bokeh(df, 'X', 'y_quadratic', 'y_quadratic_fitted_numpy', 'quadratic', fitted_line_method)
plot_poly_numpy = visualize_using_bokeh(df, 'X', 'y_poly', 'y_poly_fitted_numpy', 'poly', fitted_line_method)
# provide minified BokehJS in order to embed visualization in output of this cell
output_notebook(INLINE)
# Put all the plots in a grid layout
p = gridplot([[plot_linear_scratch, plot_quadratic_scratch, plot_poly_scratch],
[plot_linear_numpy, plot_quadratic_numpy , plot_poly_numpy]])
# Show the results
show(p)
Another visualization package which provides interactive plots is __Plotly__. For our example we can achive the same functionality with fewer lines of code compared to __Bokeh__. As a main tool we use scatter function from express module in __Plotly__. Every __Plotly__ Express function uses graph objects internally and returns a __plotly.graph_objects.Figure instance__.
In our use case we are interested in high-level feature of this module called __trendline__. This feature provides built-in approximators so we can fit line to our data without providing side computations. We should just pass the alias of regression method we want to apply. For the linear data we used OLS based trendline. In quadratic and polinomial case we used built in __LOWESS__ approximator. For the detailed description of the indicated method you can refer to the official documentation.
In order to group plots in a line we use __make_subplots__ function. Please note that __px.scatter__ function returns __plotly.graph_objects.Figure__ wich encapsulates all the objects displayed on the plot. In order to be able to add it into trace we need to pass __Scatter__ objects one by one.
It is also possible to implement the same visualization using only one call of __px.scatter__ function. In this case you will need to melt dataframe and set it to the following shape [['X','variable','values']]. 'X' contains same values as it was in original dataset the only difference is that 'X' values are repeated for each category of y data [[y_linear, y_quadratic, y_poly]]. 'variable' column consist of [[y_linear, y_quadratic, y_poly]] categories and 'values' column consits of their corresponding values. Additionaly parameter facet_col='variable' should be passed to the scatter function and trendline_scope should be defined as 'trace'(currently default). The drawback of the last method is that user can select only one trendline function, which is applied to all the categories of our data.
fig = make_subplots(rows=1, cols=3,subplot_titles=['linear','quadratic','poly'], x_title='X', y_title='y')
fig.add_trace(px.scatter(df, x="X", y="y_linear", trendline="ols", hover_data=['y_linear'], trendline_color_override="red")['data'][0], row=1, col=1)
fig.add_trace(px.scatter(df, x="X", y="y_linear", trendline="ols", hover_data=['y_linear'], trendline_color_override="red")['data'][1], row=1, col=1)
fig.add_trace(px.scatter(df, x="X", y="y_quadratic", trendline="lowess", hover_data=['y_quadratic'], trendline_color_override="red")['data'][0], row=1, col=2)
fig.add_trace(px.scatter(df, x="X", y="y_quadratic", trendline="lowess", hover_data=['y_quadratic'], trendline_color_override="red")['data'][1], row=1, col=2)
fig.add_trace(px.scatter(df, x="X", y="y_poly", trendline="lowess", hover_data=['y_poly'], trendline_color_override="red")['data'][0], row=1, col=3)
fig.add_trace(px.scatter(df, x="X", y="y_poly", trendline="lowess", hover_data=['y_poly'], trendline_color_override="red")['data'][1], row=1, col=3)
fig.update_layout(height=600, width=1500)
fig.show()
Seaborn is a visualization package which is built on top of __matplotlib__ and is integrated closely with __pandas__. Similar to matplotlib it does not provide prebuilt interactive tools which is possible to export in __html__ format. In contrast to matpltotlib seaborn mostly focused on statistical graphics and provides built-in statistical computation. In our example we use __regplot__ function which allows us to fit line to our data passing only 3 parameters (__x__,__y__ and order of polinomial).
# set color_code and font scale
sns.set_theme(color_codes=True, font_scale = 1.5)
pix = 1/plt.rcParams['figure.dpi'] # pixel in inches
fig, axes = plt.subplots(1, 3, figsize=(1500*pix, 600*pix))
# define dict of arguments to use for each plot
# thise are fitting line color, random seed, truncation of approximation to the data scale, elimination of ci plot
common_args={'line_kws': {"color": "red"}, 'seed': 42, 'truncate':True, 'ci':None}
# Add plots into axes
sns.regplot(ax=axes[0], x=df['X'], y=df['y_linear'], order=1, **common_args).set_title('linear')
sns.regplot(ax=axes[1], x=df['X'], y=df['y_quadratic'], order=2, **common_args).set_title('quadratic')
sns.regplot(ax=axes[2], x=df['X'], y=df['y_poly'], order=5, **common_args).set_title('poly')
# Set labels for x and y
for i in range(3):
axes[i].set(xlabel='X', ylabel='y')
plt.tight_layout()